{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The CIFAR-10 dataset: download and pre-processing\n",
    "============================\n",
    "\n",
    "The CIFAR-10 dataset can be downloaded at [www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html) and it is a single compressed file containing serialized numpy arrays. You can also download the dataset directly from the following link:\n",
    "\n",
    "- [cifar-10-python.tar.gz](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz): test and training sets (163 MB)\n",
    "\n",
    "The archive contains the files *data_batch_1*, *data_batch_2*, ..., *data_batch_5*, as well as *test_batch*. Each of these files is a Python \"pickled\" object produced with cPickle. The pickle files can be easilly deserialized using the following function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def unpickle(file):\n",
    "    import cPickle\n",
    "    with open(file, 'rb') as fo:\n",
    "        dict = cPickle.load(fo)\n",
    "    return dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The unpickle() method reuturns a dictionary of elements. Using `unpickle()` on the file *batches.meta* we obtain:\n",
    "\n",
    "**label_names**: a 10 element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == \"airplane\", label_names[1] == \"automobile\", etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "my_dict = unpickle(\"./batches.meta\")\n",
    "labels_names = my_dict[\"label_names\"]\n",
    "print labels_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The various *data_batch_x* files contain the following arrays:\n",
    "\n",
    "**data**: numpy array (shape: 10000x3072) of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.\n",
    "\n",
    "**labels**: list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.\n",
    "\n",
    "To continue this tutorial you need to open the CIFAR-10 archive, extract the file *data_batch_1* and copy it into the same folder of this notebook. Once you did this you can proceed with the next step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "my_dict = unpickle(\"./data_batch_1\")\n",
    "for key, value in my_dict.items() :\n",
    "    print (key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pre-processing in Numpy\n",
    "----------------------------\n",
    "\n",
    "Using **numpy** and **matplotlib** it is possible to pick a random image from the training set and show it. In this example I will take the RGB channels and I will reshape and stack them, then through matplotlib I will display the resulting image:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "training_set = my_dict[\"data\"]\n",
    "random_int = np.random.randint(0,10000)\n",
    "image_array = training_set[random_int]\n",
    "image_R = np.reshape(image_array[0:1024], (32,32))\n",
    "image_G = np.reshape(image_array[1024:2048], (32,32))\n",
    "image_B = np.reshape(image_array[2048:4096], (32,32))\n",
    "image = np.dstack((image_R, image_G, image_B))\n",
    "\n",
    "label = my_dict[\"labels\"][random_int]\n",
    "label_name = labels_names[label]\n",
    "\n",
    "print label_name\n",
    "plt.imshow(image)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pre-processing in Tensorflow\n",
    "---------------------------------\n",
    "\n",
    "In **tensorflow** we can easily manage the dataset. If all of your input data fit in memory (like in this case), the simplest way to create a dataset from them is to convert them to `tf.Tensor` objects and use `Dataset.from_tensor_slices()`. A `tf.data.Dataset` represents a sequence of elements, in which each element contains one or more Tensor objects. For example, in an image-based pipeline, an element might be a single training example, with a pair of tensors representing the image and a label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "tf_dataset = tf.data.Dataset.from_tensor_slices((my_dict[\"data\"], my_dict[\"labels\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using tensorflow it is also possible to apply a **map function** to the dataset. The map function allows us to sequentially reshape the images and to create one-hot-vector target from a single integer label. The map iterates all the elements stored into the dataset, and uses an external function to parse the data. Into the parse function it can be possible to manipulate the image (e.g. standardisation, rotation, saturation, etc) using the `tf.image` utilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def _parse_function(image_array, label):\n",
    "    label = tf.cast(label, tf.int32)\n",
    "    image_R = tf.reshape(image_array[0:1024], [32, 32])\n",
    "    image_G = tf.reshape(image_array[1024:2048], [32, 32])\n",
    "    image_B = tf.reshape(image_array[2048:4096], [32, 32])\n",
    "    image = tf.stack([image_R, image_G, image_B], axis=2)\n",
    "    #Here it may be possible to apply image manipulation\n",
    "    label_one_hot = tf.one_hot(label, depth=10)\n",
    "    return image, label_one_hot\n",
    "\n",
    "tf_dataset = tf_dataset.map(_parse_function) #parsing all the elements\n",
    "tf_dataset = tf_dataset.shuffle(10000) #shuffling all the elements\n",
    "print(tf_dataset.output_types)\n",
    "print(tf_dataset.output_shapes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have built a `Dataset` to represent your input data, the next step is to create an `Iterator` to access elements from that dataset. The `tf.data` API currently supports the following iterators, in increasing level of sophistication:\n",
    "\n",
    "- one-shot\n",
    "- initializable\n",
    "- reinitializable\n",
    "- feedable\n",
    "\n",
    "A **one-shot** iterator is the simplest form of iterator, which only supports iterating once through a dataset, with no need for explicit initialization. One-shot iterators handle almost all of the cases that the existing queue-based input pipelines support. I can be initialized using the method `make_one_shot_iterator()`. The iterator can be combined with the `batch()` method in order to get a **batch** of elements that we can use to feed the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n",
    "batch_size = 32\n",
    "tf_batch = tf_dataset.batch(batch_size)\n",
    "print(tf_batch.output_types)\n",
    "print(tf_batch.output_shapes)\n",
    "\n",
    "#the iterator is created, it will take batch_size elements from the dataset\n",
    "iterator = tf_batch.make_one_shot_iterator()\n",
    "#every time next_element is called batch_size new elements are taken\n",
    "next_element = iterator.get_next()\n",
    "\n",
    "with tf.Session() as sess:\n",
    "    #print(sess.run(next_element))\n",
    "    #print(sess.run(next_element))\n",
    "    features, labels = sess.run(next_element) #this is how you can get the features and labels back"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just to be sure we did not mess up with the dataset after all the previous operation, we can display a random image taken from the batch and check if it is corrupted. We can also print the associated label to see if it matches:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "random_int = np.random.randint(batch_size)\n",
    "image = features[random_int,:,:,:]\n",
    "label = np.argmax(labels[random_int])\n",
    "label_name = labels_names[label]\n",
    "print label_name\n",
    "plt.imshow(image)\n",
    "plt.show()    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Complete pipeline in Tensorflow\n",
    "-------------------------------------\n",
    "\n",
    "Now it is time to stich everything togheter and create a tensorflow pipeline to load the entire CIFAR-10 dataset and use it for training a model. This session is separated from the previous parts and can run independently from other cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "def unpickle(file):\n",
    "    import cPickle\n",
    "    with open(file, 'rb') as fo:\n",
    "        dict = cPickle.load(fo)\n",
    "    return dict\n",
    "\n",
    "def _parse_function(image_array, label):\n",
    "    label = tf.cast(label, tf.int32)\n",
    "    image_R = tf.reshape(image_array[0:1024], [32, 32])\n",
    "    image_G = tf.reshape(image_array[1024:2048], [32, 32])\n",
    "    image_B = tf.reshape(image_array[2048:4096], [32, 32])\n",
    "    image_stack = tf.stack([image_R, image_G, image_B], axis=2)\n",
    "    #Here the image is standardised through tf.image\n",
    "    image = tf.image.per_image_standardization(image_stack)\n",
    "    label_one_hot = tf.one_hot(label, depth=10)\n",
    "    return image, label_one_hot\n",
    "\n",
    "#Load the test set\n",
    "print \"Processing test set...\"\n",
    "my_dict = unpickle(\"./test_batch\")\n",
    "tf_test_dataset = tf.data.Dataset.from_tensor_slices((my_dict[\"data\"], my_dict[\"labels\"]))\n",
    "#Concatenate all the 5 batches in a single training dataset\n",
    "print \"Processing batch 1...\"\n",
    "my_dict = unpickle(\"./data_batch_1\")\n",
    "tf_train_dataset = tf.data.Dataset.from_tensor_slices((my_dict[\"data\"], my_dict[\"labels\"]))\n",
    "for i in range(2,6):\n",
    "    print \"Processing batch \" + str(i) + \"...\"\n",
    "    my_dict = unpickle(\"./data_batch_\" + str(i))\n",
    "    tf_train_dataset.concatenate(tf.data.Dataset.from_tensor_slices((my_dict[\"data\"], my_dict[\"labels\"])))\n",
    "\n",
    "#Parsing all the elements\n",
    "print \"Parsing the datasets...\"\n",
    "tf_test_dataset = tf_test_dataset.map(_parse_function)\n",
    "tf_train_dataset = tf_train_dataset.map(_parse_function) \n",
    "\n",
    "#Shuffling the test set\n",
    "print \"Shuffling test dataset...\"\n",
    "tf_test_dataset = tf_test_dataset.shuffle(buffer_size=50000)\n",
    "print \"done!\"\n",
    "\n",
    "#Printing type and shape of the dataset\n",
    "print \"Test dataset:\"\n",
    "print(tf_test_dataset.output_types)\n",
    "print(tf_test_dataset.output_shapes)\n",
    "print \"Train dataset:\"\n",
    "print(tf_train_dataset.output_types)\n",
    "print(tf_train_dataset.output_shapes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have **two datasets**, one for training and one for testing. You should notice that in the `_parse_function()` I applied a `tf.image.per_image_standardization()` that produced normalised `tf.float32` matrices. The **standardisation** linearly scales image to have zero mean and unit norm. It is important to use a per-image normalisation instead of a batch or dataset standardisation. If the images are standardised using information from a large set we are leaking information into our dataset. Once the model is trained we won't be able to normalize over all of our future data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with tf.name_scope('dataset'):\n",
    "    batch_size = 32\n",
    "    num_epochs = 5\n",
    "    tf_train_dataset = tf_train_dataset.batch(batch_size)\n",
    "    tf_train_dataset = tf_train_dataset.repeat(num_epochs)\n",
    "    iterator = tf_train_dataset.make_one_shot_iterator()\n",
    "    next_batch_imgs, next_batch_labels = iterator.get_next()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have done with the graph and we can initialize a session to sample batches and use them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sess = tf.Session()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the training dataset declaring an iterator. Then we can use the session and get the features and labels of a batch:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "features, labels = sess.run([next_batch_imgs, next_batch_labels])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Attention: after the standardisation of the images, we will get problems to visualise them with Matplotlib and Numpy. Now the images are normalised `float32` and no more `uint8` for this reason you will see something wired if you try to plot the features (try the code below)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "my_dict = unpickle(\"./batches.meta\")\n",
    "labels_names = my_dict[\"label_names\"]\n",
    "#random_int = np.random.randint(batch_size)\n",
    "image = features[0,:,:,:]\n",
    "label = np.argmax(labels[0])\n",
    "label_name = labels_names[label]\n",
    "print features.shape\n",
    "print label_name\n",
    "plt.imshow(image)\n",
    "plt.show() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To solve the issue we can directly use the **tensorboard** image summary. It is very easy to write the images in the summary thanks to the `tf.summary.image()` method. This method takes as input an image (or a batch of images) and display them in the `image` tab of tensorboard.\n",
    "\n",
    "The code below save a summary in `/tmp/log/`. To visualise the summary it is necessary to start tensorboard. From terminal run the command `tensorboard --logdir=/tmp/log/` and then open the browser to the address written in console. If everything worked correctly you should see some images under the tab `IMAGE`. You can resize them using the checkbox in the top-left corner `Show actual image size`.\n",
    "\n",
    "First of all we declare a placeholder that will store the features returned by a previous call to `iterator.get_next()`. We need a placeholder because calling the iterator directly through a summary operation would make the batch move forward of one step. Secondly, we create a summary operation that takes the content of the placeholder. The `max_output` parameter sets the number of images that will be displayed from the batch (default is 3). Thirdly, we have to create a summary writer in order to save on our system the tensorflow graph. To make the graph more clear I put everything inside a name scope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with tf.name_scope('batch_visualisation'):\n",
    "    batch_placeholder = tf.placeholder(tf.float32, shape=(batch_size, 32, 32, 3))\n",
    "    summary_op = tf.summary.image(\"batch_images\", batch_placeholder, max_outputs=batch_size)\n",
    "    writer = tf.summary.FileWriter(\"/tmp/log/\", sess.graph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we can call the summary operation that will take the latest images in the batch and will write them in the summary. You can repeat the following snippet multimple times, but in order to see new images you have to ask for a new batch using `features, labels = sess.run([next_batch_imgs, next_batch_labels])`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "summary = sess.run(summary_op, feed_dict={batch_placeholder: features})\n",
    "writer.add_summary(summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pre-processing using TFRecords\n",
    "---------------------------------------\n",
    "\n",
    "Another approach is to convert whatever data you have into a supported format. This approach makes it easier to mix and match data sets and network architectures. The recommended format for TensorFlow is a **TFRecords** file. You write a little program that gets your data, stuffs it in an Example protocol buffer, serializes the protocol buffer to a string, and then writes the string to a TFRecords file.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import cPickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def pickle_to_tfrecord(pickle_list, output_file):\n",
    "    with tf.python_io.TFRecordWriter(output_file) as record_writer:\n",
    "        for input_file in pickle_list:\n",
    "            #Opening the pickle (return a dictionary)\n",
    "            with open(input_file, 'rb') as fo:\n",
    "                data_dict = cPickle.load(fo)\n",
    "            #Store the dictionary data and labels in two variables\n",
    "            data = data_dict['data']\n",
    "            labels = data_dict['labels']\n",
    "            #Each row of data is an image (each element is a byte)\n",
    "            #Each row of labels is an integer (the single element is an int64)\n",
    "            for i in range(len(labels)):\n",
    "                #Getting the data as train feature\n",
    "                bytes_feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[data[i].tobytes()]))\n",
    "                int64_feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[labels[i]]))\n",
    "                #Stuff the data in an Example buffer\n",
    "                example = tf.train.Example(features=tf.train.Features(feature={'image': bytes_feature,\n",
    "                                                                               'label': int64_feature}))\n",
    "                #Serialize example to string and write in tfrecords\n",
    "                record_writer.write(example.SerializeToString())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use the function to read the CIFAR-10 pickle files and store them in a TFRecords file. I will do it two times, one for the training set, the other for the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pickle_to_tfrecord([\"./data_batch_1\", \"./data_batch_2\", \"./data_batch_3\", \"./data_batch_4\", \"./data_batch_5\"],\n",
    "                  \"./cifar10_train.tfrecord\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pickle_to_tfrecord([\"./test_batch\"], \"./cifar10_test.tfrecord\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having the two TFRecords file we can use again the utilities of `tf.data` and in particular the `TFRecordDataset` class. We can parse the data and recover the `uint8` image from the serialized `string` using the method `tf.decode_raw()`. We apply standardisation as usual and we get the one-hot vector from the label integer class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def _parse_function(example_proto):\n",
    "    features = {\"image\": tf.FixedLenFeature((), tf.string, default_value=\"\"),\n",
    "              \"label\": tf.FixedLenFeature((), tf.int64, default_value=0)}\n",
    "    parsed_features = tf.parse_single_example(example_proto, features)\n",
    "    image_decoded = tf.decode_raw(parsed_features[\"image\"], tf.uint8) #char -> uint8\n",
    "    image_R = tf.reshape(image_decoded[0:1024], [32, 32])\n",
    "    image_G = tf.reshape(image_decoded[1024:2048], [32, 32])\n",
    "    image_B = tf.reshape(image_decoded[2048:4096], [32, 32])\n",
    "    image_reshaped = tf.reshape(image_decoded, [32, 32, 3])\n",
    "    image_stack = tf.stack([image_R, image_G, image_B], axis=2)\n",
    "    image = tf.image.per_image_standardization(image_stack)\n",
    "    label_one_hot = tf.one_hot(parsed_features[\"label\"], depth=10)\n",
    "    return image, label_one_hot\n",
    "\n",
    "print \"Loading the datasets...\"\n",
    "tf_train_dataset = tf.data.TFRecordDataset(\"./cifar10_train.tfrecord\")\n",
    "print \"Parsing the datasets...\"\n",
    "tf_train_dataset = tf_train_dataset.map(_parse_function)\n",
    "print \"Verifying types and shapes...\"\n",
    "print(tf_train_dataset.output_types)\n",
    "print(tf_train_dataset.output_shapes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with tf.name_scope('dataset'):\n",
    "    batch_size = 32\n",
    "    num_epochs = 5\n",
    "    tf_train_dataset = tf_train_dataset.batch(batch_size)\n",
    "    tf_train_dataset = tf_train_dataset.repeat(num_epochs)\n",
    "    iterator = tf_train_dataset.make_one_shot_iterator()\n",
    "    next_batch_imgs, next_batch_labels = iterator.get_next()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sess = tf.Session()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "features, labels = sess.run([next_batch_imgs, next_batch_labels])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "image = features[0,:,:,:]\n",
    "print features.shape\n",
    "plt.imshow(image)\n",
    "plt.show() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
